R Workflow

Author

Frank Harrell

Published

May 6, 2022

This article outlines analysis project workflow that I’ve found to be efficient in making reproducible research reports using R with Rmarkdown and now Quarto. I start by covering the creation of annotated analysis files and running descriptive statistics on them with goals of understanding the data and the quality and completeness of the data. Functions in the Hmisc package are used to annotate data frames and data tables with labels and units of measurement and to produce tabular and graphical statistical summaries. Several examples of processing and manipulating data using the data.table package are given. Finally, examples of caching results and doing parallel processing are presented.

File Directory Structure

I have a directory for each major project, and put everything in that directory (including data files) except for graphics files for figures, which are placed in their own subdirectory underneath the project folder. The directory name for the project includes key identifying information, and files within that directory do not contain project information in their names, nor do they contain dates, unless I want to freeze an old version of an analysis script.

With Quarto I specify that html files are to be self-contained, so there are no separate graphics files.

For multi-analyst projects or ones in which you want to capture the entire code history, having the project on github is worthwhile.

Analysis File Creation

I typically create a compact analysis file in a separate R script called create.r and have it produce compressed R binary data.frame or data.table .rds files using saveRDS(name, 'name.rds', compress='xz'). Then I have an analysis script named for example a.qmd (for Quarto reports) or a.Rmd (for RMarkdown reports) that starts with d <- readRDS('name.rds').

Templates for analysis reports are here, and a comprehensive report example may be found here.

When variables need to be recoded, have labels added or changed, or have units of measurement added, I specify those using the Hmisc package upData function. Here is an example:

Variable labels and units of measurement are used in special ways in my R packages. This will show up in the describe and contents function outputs below and in axis labels for graphics.
# Function to recode from atypical coding for yes/no in raw data
yn <- function(x) factor(x, 0:1, c('yes', 'no'))
d <-
  upData(d,
         rename = c(gender='sex', any.event='anyEvent'),
         posSE    = yn(posSE),
         newMI    = yn(newMI),
         newPTCA  = yn(newPTCA),
         newCABG  = yn(newCABG),
         death    = yn(death),
         hxofHT   = yn(hxofHT),
         hxofDM   = yn(hxofDM),
         hxofCig  = factor(hxofCig, c(0, 0.5, 1),
                           c('heavy', 'moderate', 'non-smoker')), 
         hxofMI   = yn(hxofMI),
         hxofPTCA = yn(hxofPTCA),
         hxofCABG = yn(hxofCABG),
         chestpain= yn(chestpain),
         anyEvent = yn(anyEvent),
         drop=Cs(event.no,phat,mics,deltaEF,  # Cs in Hmisc; auto quotes
           newpkmphr,gdpkmphr,gdmaxmphr,gddpeakdp,gdmaxdp,
           hardness),
         labels=c(
           bhr    ='Basal heart rate',
           basebp ='Basal blood pressure',
           basedp ='Basal Double Product bhr*basebp',
           age    ='Age',
           pkhr   ='Peak heart rate',
           sbp    ='Systolic blood pressure',
           dp     ='Double product pkhr*sbp',
           dose   ='Dose of dobutamine given',
           maxhr  ='Maximum heart rate',
           pctMphr='Percent maximum predicted heart rate achieved',
           mbp    ='Maximum blood pressure',
           dpmaxdo='Double product on max dobutamine dose',
           dobdose='Dobutamine dose at max double product',
           baseEF ='Baseline cardiac ejection fraction',
           dobEF  ='Ejection fraction on dobutamine', 
           chestpain='Chest pain', 
           ecg     ='Baseline electrocardiogram diagnosis',
           restwma ='Resting wall motion abnormality on echocardiogram', 
           posSE   ='Positive stress echocardiogram',
           newMI   ='New myocardial infarction',
           newPTCA ='Recent angioplasty',
           newCABG ='Recent bypass surgery', 
           hxofHT  ='History of hypertension', 
           hxofDM  ='History of diabetes',
           hxofMI  ='History of myocardial infarction',
           hxofCig ='History of smoking',
           hxofPTCA='History of angioplasty',
           hxofCABG='History of coronary artery bypass surgery',
           anyEvent='Death, newMI, newPTCA, or newCABG'),
         units=c(age='years', bhr='bpm', basebp='mmHg', basedp='bpm*mmHg',
           pkhr='mmHg', sbp='mmHg', dp='bpm*mmHg', maxhr='bpm',
           mbp='mmHg', dpmaxdo='bpm*mmHg', baseEF='%', dobEF='%',
           pctMphr='%', dose='mg', dobdose='mg')
         )

saveRDS(d, 'stressEcho.rds', compress='xz')

# Note that we could have automatically recoded all 0:1 variables
# if they were all to be treated identically:

for(x in names(d)) 
  if(all(d[[x]] %in% c(0,1))) d[[x]] <- yn(d[[x]])

If reading data exported from REDCap that are placed into the project directory I run the following to get rid of duplicate (factor and non-factor versions of variables REDCap produces) variables and automatically convert dates to Date variables:

require(Hmisc)
getRs('importREDCap.r', put='source')  # source() code to define function
mydata <- importREDCap()  # by default operates on last downloaded export
saveRDS(mydata, 'mydata.rds', compress='xz')

When file names are not given to importREDCap the function looks for the latest created .csv file and .R file with same prefix and uses those. See this for more information.

Variable Naming

I prefer short but descriptive variable names. As exemplified above, I use variable labels and units to provide more information. For example I wouldn’t name the age varible age.at.enrollment.yrs but would name it age with a label of Age at Enrollment and with units of years. Short, clear names unclutter code, especially formulas in statistical models. One can always fetch a variable label while writing a program (e.g., typing label(d$age) at the console) to check that you have the right variable (or put the data dictionary in a window for easy reference, as shown below).

Hmisc package graphics and table making functions such as summaryM and summary.formula specially typeset units in a smaller font.

Data Dictionary

The Hmisc package contents function will provide a concise data dictionary. Here is an example using the permanent version of the dataset created above, which can be accessed with the Hmisc getHdata function. The top of the contents output has the number of levels for factor variables hyperlinked. Click on the number to go directly to the list of levels for that variable.

The permanent version of stressEcho coded binary variables as 0/1 instead of N/Y.
require(Hmisc)
getHdata(stressEcho)
d <- stressEcho
html(contents(d), levelType='table')

Data frame:d

558 observations and 31 variables, maximum # NAs:0  
NameLabelsUnitsLevelsStorage
bhrBasal heart ratebpminteger
basebpBasal blood pressuremmHginteger
basedpBasal Double Product bhr*basebpbpm*mmHginteger
pkhrPeak heart ratemmHginteger
sbpSystolic blood pressuremmHginteger
dpDouble product pkhr*sbpbpm*mmHginteger
doseDose of dobutamine givenmginteger
maxhrMaximum heart ratebpminteger
pctMphrPercent maximum predicted heart rate achieved%integer
mbpMaximum blood pressuremmHginteger
dpmaxdoDouble product on max dobutamine dosebpm*mmHginteger
dobdoseDobutamine dose at max double productmginteger
ageAgeyearsinteger
gender2integer
baseEFBaseline cardiac ejection fraction%integer
dobEFEjection fraction on dobutamine%integer
chestpainChest paininteger
restwmaResting wall motion abnormality on echocardiograminteger
posSEPositive stress echocardiograminteger
newMINew myocardial infarctioninteger
newPTCARecent angioplastyinteger
newCABGRecent bypass surgeryinteger
deathinteger
hxofHTHistory of hypertensioninteger
hxofDMHistory of diabetesinteger
hxofCigHistory of smoking3integer
hxofMIHistory of myocardial infarctioninteger
hxofPTCAHistory of angioplastyinteger
hxofCABGHistory of coronary artery bypass surgeryinteger
any.eventDeath, newMI, newPTCA, or newCABGinteger
ecgBaseline electrocardiogram diagnosis3integer

VariableLevels
gendermale
female
hxofCigheavy
moderate
non-smoker
ecgnormal
equivocal
MI

You can write the text output of contents into a text file in your current working directory, and click on that file in the RStudio Files window to create a new tab in the editor panel where you can view the data dictionary at any time. This is especially helpful if you need a reminder of variable definitions that are stored in the variable labels. Here is an example where the formatted data dictionary is saved.

Users having the xless system command installed can pop up a contents window at any time by typing xless(contents(d)) in the console. xless is in Hmisc.
capture.output(contents(d), file='contents.txt')

Better still, put the html version of the data dictionary into a small browser window to which you can refer at any point in analysis coding.

cat(html(contents(d)), file='contents.html')
browseURL('contents.html', browser='vivaldi -new-window')

In some cases it is best to have a browser window open to the full descriptive statistics for a data table/frame (see below; the describe function also shows labels, units, and levels).

For either approach it would be easy to have multiple tabs open, one tab for each of a series of data tables.

Descriptive Statistics

The Hmisc describe function is my main tool for getting initial descriptive statistics and quality controlling the data in a univariate fashion. Here is an example. The Info index is a measure of the information content in a numeric variable relative to the information in a continuous numeric variable with no ties. A very low value of Info will occur when a highly imbalanced variable is binary. Clicking on Glossary on the right will pop up a browser window with a more in-depth glossary of terms used in Hmisc package output. It links to hbiostat.org/R/glossary.html which you can link from your reports that use Hmisc..

Info comes from the approximate formula for the variance of a log odds ratio for a proportional odds model/Wilcoxon test, due to Whitehead.
Glossary

Gmd in the output stands for Gini’s mean difference—the mean absolute difference over all possible pairs of different observations. It is a very interpretable measure of dispersion that is more robust than the standard deviation.

w <- describe(d)
html(w)
d

31 Variables   558 Observations

bhr: Basal heart rate bpm
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5580690.99975.2916.57 54.0 58.0 64.0 74.0 84.0 95.3102.0
lowest : 42 44 45 46 47 , highest: 108 115 116 127 210
basebp: Basal blood pressure mmHg
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5580940.998135.323.35104.0110.0120.0133.0150.0162.3170.1
lowest : 85 88 90 97 98 , highest: 192 194 195 201 203
basedp: Basal Double Product bhr*basebp bpm*mmHg
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
55804411101812813 6607 7200 8400 9792116631361014770
lowest : 5000 5220 5280 5400 5460 , highest: 17604 17710 17748 21082 27300
pkhr: Peak heart rate mmHg
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
55801051120.625.36 81.85 90.70106.25122.00135.00147.00155.15
lowest : 52 61 62 63 66 , highest: 170 171 176 182 210
sbp: Systolic blood pressure mmHg
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
55801421146.940.72 96102120141170200210
lowest : 40 60 70 79 80 , highest: 240 250 274 283 309
dp: Double product pkhr*sbp bpm*mmHg
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5580508117634576510256113411403317060206442453626637
lowest : 5100 5940 7490 8100 8360 , highest: 32518 33400 33840 38205 45114
dose: Dose of dobutamine given mg
image
nmissingdistinctInfoMeanGmd
558070.8433.758.334
lowest : 10 15 20 25 30 , highest: 20 25 30 35 40
 Value         10    15    20    25    30    35    40
 Frequency      2    28    47    56    64    61   300
 Proportion 0.004 0.050 0.084 0.100 0.115 0.109 0.538
 

maxhr: Maximum heart rate bpm
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
55801031119.424.64 82.0 91.0104.2120.0133.0146.0154.1
lowest : 58 62 63 66 67 , highest: 170 171 176 182 200
pctMphr: Percent maximum predicted heart rate achieved %
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5580780.99978.5716.86 53 60 69 78 88 97104
lowest : 38 39 40 41 42 , highest: 116 117 126 132 133
mbp: Maximum blood pressure mmHg
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
55801320.99915635.03110.0120.0133.2150.0175.8200.0211.1
lowest : 84 90 92 93 96 , highest: 240 250 274 283 309
dpmaxdo: Double product on max dobutamine dose bpm*mmHg
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5580484118550538511346128651526018118212392489327477
lowest : 7130 8100 8360 9240 9280 , highest: 32518 33400 33840 38205 45114
dobdose: Dobutamine dose at max double product mg
image
nmissingdistinctInfoMeanGmd
558080.94130.2410.55
lowest : 5 10 15 20 25 , highest: 20 25 30 35 40
 Value          5    10    15    20    25    30    35    40
 Frequency      7     7    55    73    71    78    62   205
 Proportion 0.013 0.013 0.099 0.131 0.127 0.140 0.111 0.367
 

age: Age years
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5580620.99967.3413.4146.8551.0060.0069.0075.0082.0085.00
lowest : 26 28 29 30 33 , highest: 89 90 91 92 93
gender
nmissingdistinct
55802
 Value        male female
 Frequency     220    338
 Proportion  0.394  0.606
 

baseEF: Baseline cardiac ejection fraction %
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5580540.99455.610.7132405257626566
lowest : 20 21 22 23 25 , highest: 74 75 77 79 83
dobEF: Ejection fraction on dobutamine %
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5580600.99265.2412.3840.049.762.067.073.076.080.0
lowest : 23 25 26 27 28 , highest: 86 87 89 90 94
chestpain: Chest pain
nmissingdistinctInfoSumMeanGmd
558020.641720.30820.4272

restwma: Resting wall motion abnormality on echocardiogram
nmissingdistinctInfoSumMeanGmd
558020.7452570.46060.4978

posSE: Positive stress echocardiogram
nmissingdistinctInfoSumMeanGmd
558020.5531360.24370.3693

newMI: New myocardial infarction
nmissingdistinctInfoSumMeanGmd
558020.143280.050180.09549

newPTCA: Recent angioplasty
nmissingdistinctInfoSumMeanGmd
558020.138270.048390.09226

newCABG: Recent bypass surgery
nmissingdistinctInfoSumMeanGmd
558020.167330.059140.1115

death
nmissingdistinctInfoSumMeanGmd
558020.123240.043010.08247

hxofHT: History of hypertension
nmissingdistinctInfoSumMeanGmd
558020.6253930.70430.4173

hxofDM: History of diabetes
nmissingdistinctInfoSumMeanGmd
558020.6992060.36920.4666

hxofCig: History of smoking
image
nmissingdistinct
55803
 Value           heavy   moderate non-smoker
 Frequency         122        138        298
 Proportion      0.219      0.247      0.534
 

hxofMI: History of myocardial infarction
nmissingdistinctInfoSumMeanGmd
558020.5991540.2760.4004

hxofPTCA: History of angioplasty
nmissingdistinctInfoSumMeanGmd
558020.204410.073480.1364

hxofCABG: History of coronary artery bypass surgery
nmissingdistinctInfoSumMeanGmd
558020.399880.15770.2661

any.event: Death, newMI, newPTCA, or newCABG
nmissingdistinctInfoSumMeanGmd
558020.402890.15950.2686

ecg: Baseline electrocardiogram diagnosis
image
nmissingdistinct
55803
 Value         normal equivocal        MI
 Frequency        311       176        71
 Proportion     0.557     0.315     0.127
 

# To create a separate browser window:
cat(html(w), file='desc.html')
browseURL('desc.html', browser='firefox -new-window')

There is also a plot method for describe output. It produces two graphics objects: one for categorical variables and one for continuous variables. The default is to use ggplot2 to produce static graphics.

p <- plot(w, bvspace=2.5)
p$Categorical

p$Continuous

To make creation of tabs in Quarto easier, here is a function that loops through the elements of a named list and outputs each element into a separate Quarto tab. A wide argument is used to expand the width of the output outside the usual margins. An initblank argument creates a first tab that is empty. This allows one to show nothing until one of the other tabs is clicked.

This function is in Github and can be defined using the Hmisc getRs function, i.e., getRs('maketabs.r', put='source').
maketabs <- function(x, labels=names(x), wide=FALSE, initblank=FALSE) {
  yaml <- paste0('.panel-tabset', if(wide) ' .column-page')
  k <- c('', '::: {', yaml, '}', '')
  if(initblank) k <- c(k, '', '##   ', '')
  .objmaketabs. <<- x
  for(i in 1 : length(x)) {
    cname <- paste0('c', round(100000 * runif(1)))
    k <- c(k, '', paste('##', labels[i]), '',
           paste0('```{r ', cname, ',results="asis",echo=FALSE}'),
           paste0('.objmaketabs.[[', i, ']]'), '```', '')
  }
  k <- c(k, ':::', '')
  cat(knitr::knit(text=knitr::knit_expand(text=k), quiet=TRUE))
}
maketabs(p)

By specifying grType option you can instead get plotly graphics that use hover text to show more information, especially when hovering over the leftmost dot or tick mark for a variable.

options(grType='plotly')
p <- plot(w, bvspace=2.5)
maketabs(p, wide=TRUE)

See this for other Hmisc functions for descriptive graphics and tables, especially for stratified descriptive statistics for categorical variables. The summaryM function prints a tabular summary of a mix of continuous and categorical variables. Here is an example where stratification is by history of myocardial infarction (MI).

The initblank option to the maketabs function creates the initially opened tab, which has no content, thereby leaving it to the user to click one of the other tabs before output is shown.
require(data.table)
setDT(d)   # turn d into a data table
w <- d
w[, hxofMI := factor(hxofMI, 0 : 1, c('No history of MI', 'History of MI'))]
vars <- setdiff(names(d), 'hxofMI')
form <- as.formula(paste(paste(vars, collapse='+'), '~ hxofMI'))
print(form)

bhr + basebp + basedp + pkhr + sbp + dp + dose + maxhr + pctMphr + mbp + dpmaxdo + dobdose + age + gender + baseEF + dobEF + chestpain + restwma + posSE + newMI + newPTCA + newCABG + death + hxofHT + hxofDM + hxofCig + hxofPTCA + hxofCABG + any.event + ecg ~ hxofMI

s <- summaryM(form, data=d, test=TRUE)
w <- list('Table 1'                   = html(s, exclude1=TRUE, npct='both', digits=3, middle.bold=TRUE),
          'Categorical Variable Plot' = plot(s, which='categorical', vars=1 : 4, height=600, width=1000),
          'Continuous Variable Plot'  = plot(s, which='continuous',  vars=1 : 4))
# initblank=TRUE: Put an empty tab first so that nothing initially displays
maketabs(w, wide=TRUE, initblank=TRUE)
Descriptive Statistics (N=558).
No history of MI
N=404
History of MI
N=154
Test Statistic
Basal heart rate
bpm
65 74 85 63 72 84 F1 556=1.41, P=0.2351
Basal blood pressure
mmHg
120 134 150 120 130 150 F1 556=1.39, P=0.2381
Basal Double Product bhr*basebp
bpm*mmHg
8514 9874 11766 8026 9548 11297 F1 556=3.32, P=0.0691
Peak heart rate
mmHg
107 123 136 104 120 132 F1 556=2.35, P=0.1261
Systolic blood pressure
mmHg
124 146 174 115 134 158 F1 556=12.1, P<0.0011
Double product pkhr*sbp
bpm*mmHg
14520 17783 21116 13198 15539 18885 F1 556=15, P<0.0011
Dose of dobutamine given
mg
: 10
0.00 2404 0.00 0154 χ26=8.77, P=0.1872
  15 0.05 21404 0.05 7154
  20 0.10 40404 0.05 7154
  25 0.11 45404 0.07 11154
  30 0.11 43404 0.14 21154
  35 0.11 45404 0.10 16154
  40 0.51 208404 0.60 92154
Maximum heart rate
bpm
107 122 134 102 118 130 F1 556=4.05, P=0.0451
Percent maximum predicted heart rate achieved
%
69.0 78.0 89.0 70.0 77.0 87.5 F1 556=0.5, P=0.4791
Maximum blood pressure
mmHg
138 154 180 130 142 162 F1 556=13, P<0.0011
Double product on max dobutamine dose
bpm*mmHg
15654 18666 21664 14489 16785 19680 F1 556=16.1, P<0.0011
Dobutamine dose at max double product
mg
: 5
0.01 4404 0.02 3154 χ27=8.5, P=0.292
  10 0.01 6404 0.01 1154
  15 0.11 43404 0.08 12154
  20 0.14 58404 0.10 15154
  25 0.13 54404 0.11 17154
  30 0.13 51404 0.18 27154
  35 0.10 40404 0.14 22154
  40 0.37 148404 0.37 57154
Age
years
59.0 68.0 75.0 63.2 71.0 76.8 F1 556=9.75, P=0.0021
gender : female 0.68 273404 0.42 65154 χ21=30, P<0.0012
Baseline cardiac ejection fraction
%
55 59 63 40 54 60 F1 556=56.4, P<0.0011
Ejection fraction on dobutamine
%
65.0 70.0 74.2 50.0 64.5 70.0 F1 556=50.3, P<0.0011
Chest pain 0.29 119404 0.34 53154 χ21=1.29, P=0.2572
Resting wall motion abnormality on echocardiogram 0.57 230404 0.18 27154 χ21=69.7, P<0.0012
Positive stress echocardiogram 0.21 86404 0.32 50154 χ21=7.56, P=0.0062
New myocardial infarction 0.03 14404 0.09 14154 χ21=7.4, P=0.0072
Recent angioplasty 0.02 10404 0.11 17154 χ21=17.8, P<0.0012
Recent bypass surgery 0.05 21404 0.08 12154 χ21=1.35, P=0.2462
death 0.04 15404 0.06 9154 χ21=1.23, P=0.2672
History of hypertension 0.69 280404 0.73 113154 χ21=0.89, P=0.3462
History of diabetes 0.36 147404 0.38 59154 χ21=0.18, P=0.6742
History of smoking : heavy 0.21 83404 0.25 39154 χ22=3.16, P=0.2062
  moderate 0.24 96404 0.27 42154
  non-smoker 0.56 225404 0.47 73154
History of angioplasty 0.04 15404 0.17 26154 χ21=28.4, P<0.0012
History of coronary artery bypass surgery 0.08 34404 0.35 54154 χ21=59.6, P<0.0012
Death, newMI, newPTCA, or newCABG 0.12 48404 0.27 41154 χ21=18.1, P<0.0012
Baseline electrocardiogram diagnosis : normal 0.59 240404 0.46 71154 χ22=8, P=0.0182
  equivocal 0.29 117404 0.38 59154
  MI 0.12 47404 0.16 24154
a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables.
Tests used: 1Wilcoxon test; 2Pearson test .

Semi-interactive stratified spike histograms are also useful descriptive plots. These plots also contain a superset of the quantiles used in box plots, and the legend is clickable, allowing any of the statistical summaries to be turned off.

d[, histboxp(x=maxhr, group=ecg, bins=200)]

Data Manipulation and Aggregation

The data.table package provides a concise, consistent syntax for managing simple and complex data manipulation tasks, and it is extremely efficient for large datasets. One of the best organized tutorials is this, and a master cheatsheet for data.table is here from which the general syntax below is taken, where DT represents a data table.

   DT[ i,  j,  by ] # + extra arguments
        |   |   |
        |   |    -------> grouped by what?
        |    -------> what to do?
         ---> on which rows?

Data tables are also data frames so work on any method handling data frames. But data tables do not contain rownames.

Several data.table examples follow. I like to hold the current dataset in d to save typing. To facilitate some operations requiring variable names to be quoted, define a function .q to quote them automatically (like the Hmisc function Cs).

.q <- function(...) as.character(sys.call())[-1]
.q(a, b, c, 'this and that')
[1] "a"             "b"             "c"             "this and that"

Analyzing Selected Variables and Subsets

d[, html(describe(age))]
age: Age years
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5580620.99967.3413.4146.8551.0060.0069.0075.0082.0085.00
lowest : 26 28 29 30 33 , highest: 89 90 91 92 93
d[, html(describe(~ age + gender))]
age + gender

2 Variables   558 Observations

age: Age years
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5580620.99967.3413.4146.8551.0060.0069.0075.0082.0085.00
lowest : 26 28 29 30 33 , highest: 89 90 91 92 93
gender
nmissingdistinct
55802
 Value        male female
 Frequency     220    338
 Proportion  0.394  0.606
 

d[gender == 'female', html(describe(age))]   # analyze age for females
age: Age years
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
3380580.99967.0113.7447.0050.7059.2568.0076.0083.0085.00
lowest : 26 28 29 33 34 , highest: 87 88 89 90 91
html(describe(d[, .(age, gender)], 'Age and Gender Stats'))
Age and Gender Stats

2 Variables   558 Observations

age: Age years
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5580620.99967.3413.4146.8551.0060.0069.0075.0082.0085.00
lowest : 26 28 29 30 33 , highest: 89 90 91 92 93
gender
nmissingdistinct
55802
 Value        male female
 Frequency     220    338
 Proportion  0.394  0.606
 

# Separate analysis by female, male.  Use keyby instead of by to sort the usual way.
d[, print(describe(age, descript=gender)), by=gender]
male : Age [years] 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     220        0       51    0.999    67.86    12.91    45.95    51.00 
     .25      .50      .75      .90      .95 
   62.00    69.00    75.00    81.00    86.00 

lowest : 30 34 38 40 43, highest: 88 89 91 92 93
female : Age [years] 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     338        0       58    0.999    67.01    13.74    47.00    50.70 
     .25      .50      .75      .90      .95 
   59.25    68.00    76.00    83.00    85.00 

lowest : 26 28 29 33 34, highest: 87 88 89 90 91
Empty data.table (0 rows and 1 cols): gender
# Compute mean and median age by gender
d[, .(Mean=mean(age), Median=median(age)), by=gender]
   gender     Mean Median
1:   male 67.85909     69
2: female 67.00888     68
# To create a new subset
w <- d[gender == 'female' & age < 70, ]

Adding or Changing Variables

With data.table you can create a new data table with added variables, or you can add or redefine variables in a data table in place. The latter has major speed and memory efficiency implications when processing massive data tables. Here d refers to a different data table from the one used above.

# Rename a variable
setnames(d, c('gender', 'height'),
            c(  'sex',      'ht'))
# Easier:
setnames(d, .q(gender, height),
            .q(   sex,     ht))

# Add two new variables, storing result in a new data table
z <- d[, .(bmi=wt / ht ^ 2, bsa=0.016667 * sqrt(wt * ht))]

# Add one new variable in place
d[, bmi := wt / ht ^ 2]

# Add two new variables in place
d[, `:=`(bmi = wt / ht ^ 2, bas=0.016667 * sqrt(wt * ht))]
d[, .q(bmi, bsa) := .(wt / ht ^ 2, 0.016667 * sqrt(wt * ht))]

# Compute something requiring a different formula for different types
# of observations
d[, htAdj := ifelse(sex == 'male', ht, ht * 1.07)]
d[, htAdj := ht * ifelse(sex == 'male', 1, 1.07)]
d[, htAdj := (sex == 'male') * ht + (sex == 'female') * ht * 1.07]
d[sex == 'male',   htAdj := ht]
d[sex == 'female', htAdj := ht * 1.07]

# Add label & optional units (better to use upData which works on data tables)
adlab <- function(x, lab, un='') {
  label(x) <- lab
  if(un != '') units(x) <- un
  x
}
d[, maxhr := adlab(maxhr, 'Maximum Heart Rate', '/m')]

# Delete a variable (or use upData)
d[, bsa := NULL]

# Delete two variables
d[, `:=`(bsa=NULL, bmi=NULL)]
d[, .q(bsa, bmi) := NULL]

Recoding Variables

# Group levels a1, a2 as a & b1,b2,b3 as b
d[, x := factor(x, .q(a1,a2,b1,b2,b3),
                   .q( a, a, b, b, b))]
# Regroup an existing factor variable
levels(d$x) <- list(a=.q(a1,a2), b=.q(b1,b2,b3))
# Or manipulate character strings
d[, x := substring(x, 1, 1)]   # replace x with first character of levels
# or
levels(d$x) <- substring(levels(d$x), 1, 1)

# Recode a numeric variable with values 0, 1, 2, 3, 4 to 0, 1, 1, 1, 2
d[, x := 1 * (x %in% 1:3) + 2 * (x == 4)]

# Recode a series of conditions to a factor variable whose value is taken
# from the last condition that is TRUE using Hmisc::score.binary
# Result is a factor variable unless you add retfactor=FALSE
d[, x := score.binary(x1 == 'symptomatic',
                      x2 %in% c('stroke', 'MI'),
                      death)]
# Same but code with numeric points
d[, x := score.binary(x1 == 'symptomatic',
                      x2 %in% c('stroke', 'MI'),
                      death,  # TRUE/FALSE or 1/0 variable
                      points=c(1,2,10))]

# Recode from one set of character strings to another using named vector
# A named vector can be subscripted with character strings as well as integers
states <- c(AL='Alabama', AK='Alaska', AZ='Arizona', ...)
# Could also do:
#  states <- .q(Alabama, Alaska, Arizona, ..., 'New Mexico', ...)
#  names(states) <- .q(AL, AK, AZ, ...)
# or do a merge for table lookup (see later)
d[, State := states[state])
# states are unique, state can have duplicates and all are recoded

# Recode from integers 1, 2, ..., to character strings
labs <- .q(elephant, giraffe, dog, cat)
d[, x := labs[x]]

# Recode from character strings to integers
d[, x := match(x, labs)]

Computing Summary Statistics

Many applications can use the automatically created data.table object .SD which stands for the data table for the current group being processed. If .SDcols were not specified, all variables would be attempted to be analyzed. Specify a vector of variable names as .SDcols to restrict the analysis. If there were no by variable(s), .SD stands for the whole data table.

# Compute the number of distinct values for all variables
nd <- function(x) length(unique(x))
d[, sapply(.SD, nd)]
      bhr    basebp    basedp      pkhr       sbp        dp      dose     maxhr 
       69        94       441       105       142       508         7       103 
  pctMphr       mbp   dpmaxdo   dobdose       age    gender    baseEF     dobEF 
       78       132       484         8        62         2        54        60 
chestpain   restwma     posSE     newMI   newPTCA   newCABG     death    hxofHT 
        2         2         2         2         2         2         2         2 
   hxofDM   hxofCig    hxofMI  hxofPTCA  hxofCABG any.event       ecg 
        2         3         2         2         2         2         3 
# Same but only for variables whose names contain hx and either D or M
d[, sapply(.SD, nd), .SDcols=patterns('hx', 'D|M')]
hxofDM hxofMI 
     2      2 
# Compute means on all numeric variables
mn <- function(x) mean(x, na.rm=TRUE)
d[, lapply(.SD, mn), .SDcols=is.numeric]
        bhr   basebp   basedp     pkhr      sbp       dp     dose    maxhr
1: 75.29032 135.3244 10181.31 120.5502 146.9158 17633.84 33.75448 119.3692
    pctMphr mbp  dpmaxdo  dobdose      age   baseEF    dobEF chestpain
1: 78.56989 156 18549.88 30.24194 67.34409 55.60394 65.24194 0.3082437
     restwma     posSE      newMI   newPTCA    newCABG      death    hxofHT
1: 0.4605735 0.2437276 0.05017921 0.0483871 0.05913978 0.04301075 0.7043011
      hxofDM  hxofPTCA  hxofCABG any.event
1: 0.3691756 0.0734767 0.1577061 0.1594982
# Compute means on all numeric non-binary variables
nnb <- function(x) is.numeric(x) && length(unique(x)) > 2
d[, lapply(.SD, mn), .SDcols=nnb]
        bhr   basebp   basedp     pkhr      sbp       dp     dose    maxhr
1: 75.29032 135.3244 10181.31 120.5502 146.9158 17633.84 33.75448 119.3692
    pctMphr mbp  dpmaxdo  dobdose      age   baseEF    dobEF
1: 78.56989 156 18549.88 30.24194 67.34409 55.60394 65.24194
# Print frequency tables of all categorical variables with > 2 levels
cmult <- function(x) ! is.numeric(x) && length(unique(x)) > 2
tab <- function(x) {
  z <- table(x, useNA='ifany')
  paste(paste0(names(z), ': ', z), collapse=', ')
}
d[, lapply(.SD, tab), .SDcols=cmult]
                                      hxofCig
1: heavy: 122, moderate: 138, non-smoker: 298
                                   ecg
1: normal: 311, equivocal: 176, MI: 71

Tabulate all variables having between 3 and 10 distinct values and create a side effect when data.table is running that makes the summarization function tab store all values and frequencies in a growing list Z so that kable can render a markdown table after we pad columns to the maximum length of all columns (maximum number of distinct values).

# Using <<- makes data.table have a side effect of augmenting Z and
# Align in the global environment
tab <- function(x) {
  z <- table(x, useNA='ifany')
  i <- length(Z)
  Z[[i+1]] <<- names(z)
  Z[[i+2]] <<- as.vector(z)
  Align <<- c(Align, if(is.numeric(x)) 'r' else 'l', 'r')
  length(z)
}
discr <- function(x) { i <- length(unique(x)); i > 2 & i < 11 }
Z    <- list(); Align <- character(0)
w    <- d[, lapply(.SD, tab), .SDcols=discr]
maxl <- max(w)
# Pad shorter vectors with blanks
Z <- lapply(Z, function(x) c(x, rep('', maxl - length(x))))
Z <- do.call('cbind', Z)  # combine all into columns of a matrix
colnames(Z) <- rep(names(w), each=2)
colnames(Z)[seq(2, ncol(Z), by=2)] <- 'Freq'
knitr::kable(Z, align=Align)
dose Freq dobdose Freq hxofCig Freq ecg Freq
10 2 5 7 heavy 122 normal 311
15 28 10 7 moderate 138 equivocal 176
20 47 15 55 non-smoker 298 MI 71
25 56 20 73
30 64 25 71
35 61 30 78
40 300 35 62
40 205

A better approach is to let the kables function put together a series of separate markdown tables of different sizes. By using the “updating Z in the global environment” side effect we are able to let data.table output any type of objects of non-conformable dimensions over variables (such as frequency tabulations).

tab <- function(x) {
  z <- table(x, useNA='ifany')
  i <- length(Z)
  w <- matrix(cbind(names(z), z), ncol=2,
              dimnames=list(NULL, c(vnames[i+1], 'Freq')))
  Z[[i+1]] <<- knitr::kable(w, align=c(if(is.numeric(x)) 'r' else 'l', 'r'))
  length(z)
}
discr <- function(x) { i <- length(unique(x)); i > 2 & i < 11 }
Z      <- list()
vnames <- names(d[, .SD, .SDcols=discr])
w      <- d[, lapply(.SD, tab), .SDcols=discr]
knitr::kables(Z)
dose Freq
10 2
15 28
20 47
25 56
30 64
35 61
40 300
dobdose Freq
5 7
10 7
15 55
20 73
25 71
30 78
35 62
40 205
hxofCig Freq
heavy 122
moderate 138
non-smoker 298
ecg Freq
normal 311
equivocal 176
MI 71

Use a similar side-effect approach to get separate html describe output by gender.

g <- function(x, by) {
  Z[[length(Z) + 1]] <<- describe(x, descript=paste('age for', by))
  by
}
Z <- list()
by <- d[, g(age, gender), by=gender]
# Make Z look like describe() output for multiple variables
class(Z) <- 'describe'
attr(Z, 'dimensions') <- c(nrow(d), nrow(by))
attr(Z, 'descript') <- 'Age by Gender'
html(Z)
Age by Gender

2 Variables   558 Observations

age for male: Age years
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
2200510.99967.8612.9145.9551.0062.0069.0075.0081.0086.00
lowest : 30 34 38 40 43 , highest: 88 89 91 92 93
age for female: Age years
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
3380580.99967.0113.7447.0050.7059.2568.0076.0083.0085.00
lowest : 26 28 29 33 34 , highest: 87 88 89 90 91
# Compute a 1-valued statistic on multiple variables, by cross-classification
# of two variables.  Do this on a subset.  .SDcols=a:b uses variables in order
# Use keyby instead of by to order output the usual way
d[age < 70, lapply(.SD, mean), keyby=.(gender, newMI), .SDcols=pkhr:dp]
   gender newMI     pkhr      sbp       dp
1:   male     0 122.0561 147.7664 17836.24
2:   male     1 115.6667 140.6667 16437.67
3: female     0 122.8492 150.5084 18509.03
4: female     1 123.5714 171.5714 21506.71
# Compute multiple statistics on one variable
# Note: .N is a special variable: count of obs for current group
d[, .(Max=max(bhr), Min=min(bhr), Mean=mean(bhr), N=.N), by=.(gender, newMI)]
   gender newMI Max Min     Mean   N
1:   male     0 127  42 74.15385 208
2:   male     1  89  59 71.25000  12
3: female     0 210  45 75.96894 322
4: female     1 108  65 79.43750  16
# Same result another way
g <- function(x) list(Max=max(x), Min=min(x), Mean=mean(x), N=length(x))
d[, g(bhr), by=.(gender, newMI)]  # if g returned a vector instead, use as.list(g(bhr))
   gender newMI Max Min     Mean   N
1:   male     0 127  42 74.15385 208
2:   male     1  89  59 71.25000  12
3: female     0 210  45 75.96894 322
4: female     1 108  65 79.43750  16
d[, as.list(quantile(bhr)), by=gender]
   gender 0% 25% 50% 75% 100%
1:   male 42  63  72  83  127
2: female 45  65  75  85  210
# Compute multiple statistics on multiple variables
d[, lapply(.SD, quantile), by=gender, .SDcols=.q(bhr, pkhr, sbp)]
    gender bhr   pkhr    sbp
 1:   male  42  52.00  80.00
 2:   male  63 106.75 120.00
 3:   male  72 123.00 140.00
 4:   male  83 136.00 166.00
 5:   male 127 176.00 250.00
 6: female  45  61.00  40.00
 7: female  65 106.25 122.25
 8: female  75 122.00 141.50
 9: female  85 134.00 170.00
10: female 210 210.00 309.00
# Similar but put percentile number in front of statistic value
# Do only quartiles
g <- function(x) {
  z <- quantile(x, (1:3)/4, na.rm=TRUE)
  paste(format(names(z)), format(round(z)))
}
d[, lapply(.SD, g), by=gender, .SDcols=.q(bhr, pkhr, sbp)]
   gender    bhr    pkhr     sbp
1:   male 25% 63 25% 107 25% 120
2:   male 50% 72 50% 123 50% 140
3:   male 75% 83 75% 136 75% 166
4: female 25% 65 25% 106 25% 122
5: female 50% 75 50% 122 50% 142
6: female 75% 85 75% 134 75% 170
# To have more control over labeling and to have one row per sex:
g <- function(x) {
  s <- sapply(x, quantile, na.rm=TRUE)  # compute quantiles for all variables -> matrix
  h <- as.list(s)  # vectorizes first
  # Cross row names (percentile names) with column (variable) names
  # paste(b, a) puts variable name in front of percentile
  names(h) <- outer(rownames(s), colnames(s), function(a, b) paste(b, a))
  h
}
d[, g(.SD), by=gender, .SDcols=.q(bhr, pkhr, sbp)]
   gender bhr 0% bhr 25% bhr 50% bhr 75% bhr 100% pkhr 0% pkhr 25% pkhr 50%
1:   male     42      63      72      83      127      52   106.75      123
2: female     45      65      75      85      210      61   106.25      122
   pkhr 75% pkhr 100% sbp 0% sbp 25% sbp 50% sbp 75% sbp 100%
1:      136       176     80  120.00   140.0     166      250
2:      134       210     40  122.25   141.5     170      309
# Restrict to variables bhr - basedp in order columns created in data table
d[, g(.SD), by=gender, .SDcols=bhr : basedp]
   gender bhr 0% bhr 25% bhr 50% bhr 75% bhr 100% basebp 0% basebp 25%
1:   male     42      63      72      83      127        85     120.00
2: female     45      65      75      85      210        90     122.25
   basebp 50% basebp 75% basebp 100% basedp 0% basedp 25% basedp 50% basedp 75%
1:        130        145         203      5220    7976.25       9483   11183.50
2:        136        150         201      5000    8562.00      10063   11891.25
   basedp 100%
1:       16704
2:       27300
# Can put ! in front of a sequence of variables to do the opposite

# To add duplicated means to raw data use e.g.
# d[, Mean := mean(x), by=sex]

Merging Data

Consider a baseline dataset b and a longitudinal dataset L, with subject ID of id.

For more information see this, this, this, this and this. To merge any number of datasets at once and obtain a printed report of how the merge went, use the Hmisc Merge function.
b <- data.table(id=1:4, age=c(21, 28, 32, 23), key='id')
L <- data.table(id  = c(2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5),
                day = c(1, 2, 3, 1, 2, 3, 4, 1, 2, 1, 2, 3),
                y    =  1 : 12, key='id')
b
   id age
1:  1  21
2:  2  28
3:  3  32
4:  4  23
L
    id day  y
 1:  2   1  1
 2:  2   2  2
 3:  2   3  3
 4:  3   1  4
 5:  3   2  5
 6:  3   3  6
 7:  3   4  7
 8:  4   1  8
 9:  4   2  9
10:  5   1 10
11:  5   2 11
12:  5   3 12
# Merge b and L to look up baseline age and associate it with all follow-ups
b[L, on=.(id)]   # Keeps all ids in L (left inner join)
    id age day  y
 1:  2  28   1  1
 2:  2  28   2  2
 3:  2  28   3  3
 4:  3  32   1  4
 5:  3  32   2  5
 6:  3  32   3  6
 7:  3  32   4  7
 8:  4  23   1  8
 9:  4  23   2  9
10:  5  NA   1 10
11:  5  NA   2 11
12:  5  NA   3 12
L[b, on=.(id)]   # Keeps all ids in b (right inner join)
    id day  y age
 1:  1  NA NA  21
 2:  2   1  1  28
 3:  2   2  2  28
 4:  2   3  3  28
 5:  3   1  4  32
 6:  3   2  5  32
 7:  3   3  6  32
 8:  3   4  7  32
 9:  4   1  8  23
10:  4   2  9  23
L[b, on=.(id), nomatch=NULL]  # Keeps only ids in both b and L (right outer join)
   id day y age
1:  2   1 1  28
2:  2   2 2  28
3:  2   3 3  28
4:  3   1 4  32
5:  3   2 5  32
6:  3   3 6  32
7:  3   4 7  32
8:  4   1 8  23
9:  4   2 9  23
uid <- unique(c(b[, id], L[, id]))
L[b[.(uid), on=.(id)]]         # Keeps ids in either b or c (full outer join)
    id day  y age
 1:  1  NA NA  21
 2:  2   1  1  28
 3:  2   2  2  28
 4:  2   3  3  28
 5:  3   1  4  32
 6:  3   2  5  32
 7:  3   3  6  32
 8:  3   4  7  32
 9:  4   1  8  23
10:  4   2  9  23
11:  5   1 10  NA
12:  5   2 11  NA
13:  5   3 12  NA
merge(b, L, by='id', all=TRUE) # also full outer join; calls merge.data.table
    id age day  y
 1:  1  21  NA NA
 2:  2  28   1  1
 3:  2  28   2  2
 4:  2  28   3  3
 5:  3  32   1  4
 6:  3  32   2  5
 7:  3  32   3  6
 8:  3  32   4  7
 9:  4  23   1  8
10:  4  23   2  9
11:  5  NA   1 10
12:  5  NA   2 11
13:  5  NA   3 12

For very large data tables, giving the data tables keys will speed execution, e.g.:

setkey(d, id)
setkey(d, state, city)

Join/merge can be used for data lookups:

s <- data.table(st=.q(AL, AK, AZ, CA, OK), y=5:1)
stateAbbrevs <- data.table(state=state.abb, State=state.name)
s[stateAbbrevs, , on=.(st=state), nomatch=NULL]
   st y      State
1: AL 5    Alabama
2: AK 4     Alaska
3: AZ 3    Arizona
4: CA 2 California
5: OK 1   Oklahoma

Reshaping Data

To reshape data from long to wide format, take the L data table above as an example.

w <- dcast(L, id ~ day, value.var='y')
w
   id  1  2  3  4
1:  2  1  2  3 NA
2:  3  4  5  6  7
3:  4  8  9 NA NA
4:  5 10 11 12 NA
# Now reverse the procedure
m <- melt(w, id.vars='id', variable.name='day', value.name='y')
m
    id day  y
 1:  2   1  1
 2:  3   1  4
 3:  4   1  8
 4:  5   1 10
 5:  2   2  2
 6:  3   2  5
 7:  4   2  9
 8:  5   2 11
 9:  2   3  3
10:  3   3  6
11:  4   3 NA
12:  5   3 12
13:  2   4 NA
14:  3   4  7
15:  4   4 NA
16:  5   4 NA
setkey(m, id, day)   # sorts
m[! is.na(y)]
    id day  y
 1:  2   1  1
 2:  2   2  2
 3:  2   3  3
 4:  3   1  4
 5:  3   2  5
 6:  3   3  6
 7:  3   4  7
 8:  4   1  8
 9:  4   2  9
10:  5   1 10
11:  5   2 11
12:  5   3 12

Other Manipulations of Longitudinal Data

For the longitudinal data table L carry forward to 4 days the subject’s last observation on y if it was assessed earlier than day 4.

w <- copy(L)   # fresh start with no propagation of changes back to L
# Compute each day's within-subject record number and last record number
# Feed this directly into a data.table operation to save last records 
# when the last is on a day < 4
u <- w[, .q(seq, maxseq) := .(1 : .N, .N), by=id][seq == maxseq & day < 4,]
# Extra observations to fill out to day 4
u <- u[, .(day = (day + 1) : 4, y = y), by=id]
u
   id day  y
1:  2   4  3
2:  4   3  9
3:  4   4  9
4:  5   4 12
w <- rbind(L, u, fill=TRUE)
setkey(w, id, day)  # sort and index
w
    id day  y
 1:  2   1  1
 2:  2   2  2
 3:  2   3  3
 4:  2   4  3
 5:  3   1  4
 6:  3   2  5
 7:  3   3  6
 8:  3   4  7
 9:  4   1  8
10:  4   2  9
11:  4   3  9
12:  4   4  9
13:  5   1 10
14:  5   2 11
15:  5   3 12
16:  5   4 12

Find the first time at which y >= 3 and at which y >= 7

L[, .(first3 = min(day[y >= 3]),
      first7 = min(day[y >= 7])), by=id]
   id first3 first7
1:  2      3    Inf
2:  3      1      4
3:  4      1      1
4:  5      1      1

Same but instead of resulting in an infinite value if no observations for a subject meet the condition, make the result NA.

mn <- function(x) if(length(x)) min(x) else as.double(NA)
# as.double needed because day is stored as double precision
# (type contents(L) to see this) and data.table requires
# consistent storage types
L[, .(first3 = mn(day[y >= 3]),
      first7 = mn(day[y >= 7])), by=id]
   id first3 first7
1:  2      3     NA
2:  3      1      4
3:  4      1      1
4:  5      1      1

Analysis

Formatting

For analysis the sky is the limit. I take advantage of special formatting for model fit objects from the rms package by using html or latex methods and putting results='asis' in the chunk header to preserve the special formatting. Here is an example.

require(rms)
options(prType='html')  # needed to use special formatting (can use prType='latex')
set.seed(1)
n <- 1000
w <- data.table(x=runif(n), x2=rnorm(n), y=sample(0:1, n, replace=TRUE))
dd <- datadist(w); options(datadist='dd')  # rms needs for summaries and plotting
f <- lrm(y ~ x + rcs(x2, 4), data=w)
f

Logistic Regression Model

 lrm(formula = y ~ x + rcs(x2, 4), data = w)
 
Model Likelihood
Ratio Test
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 1000 LR χ2 1.91 R2 0.003 C 0.526
0 492 d.f. 4 R24,1000 0.000 Dxy 0.051
1 508 Pr(>χ2) 0.7516 R24,749.8 0.000 γ 0.051
max |∂log L/∂β| 2×10-14 Brier 0.249 τa 0.026
β S.E. Wald Z Pr(>|Z|)
Intercept   0.1079  0.2976 0.36 0.7169
x   0.1506  0.2197 0.69 0.4931
x2   0.0502  0.2142 0.23 0.8147
x2’  -0.3705  0.6631 -0.56 0.5763
x2’’   1.3624  2.6219 0.52 0.6033
anova(f)
Wald Statistics for y
χ2 d.f. P
x 0.47 1 0.4931
x2 1.44 3 0.6961
Nonlinear 0.33 2 0.8499
TOTAL 1.91 4 0.7523

Caching

The workhorse behind Rmarkdown and Quarto is knitr, which processes the code chunks and properly mingles code and tabular and graphical output. knitr has a built-in caching mechanism to make it so that code is not needlessly executed when the code inputs have not changed. This easy-to-use process does have two disadvantages: the dependencies are not transparent, and the stored cache files may be quite large. I like to take control of caching. To that end, the runifChanged function was written. Here is an example of its use. First a function with no arguments must be composed. This is the (usually slow) function that will be conditionally run if any of a group of listed objects has changed since the last time it was run. This function when needed to be run produces an object that is stored in binary form in a user-specified file (the default file name is the name of the current R code chunk with .rds appended).

# Read the source code for the hashCheck and runifChanged functions from
# https://github.com/harrelfe/rscripts/blob/master/hashCheck.r
getRs('hashCheck.r', put='source')
g <- function() {
  # Fit a logistic regression model and bootstrap it 500 times, saving
  # the matrix of bootstrapped coefficients
  f <- lrm(y ~ x1 + x2, x=TRUE, y=TRUE, data=dat)
  bootcov(f, B=500)
}
set.seed(3)
n   <- 2000
dat <- data.table(x1=runif(n), x2=runif(n),
                  y=sample(0:1, n, replace=TRUE))
# runifChanged will write runifch.rds if needed (chunk name.rds)
# Will run if dat or source code for lrm or bootcov change
b <- runifChanged(g, dat, lrm, bootcov)
dim(b$boot.Coef)
[1] 500   3
head(b$boot.Coef)
       Intercept          x1          x2
[1,]  0.02007292 -0.30079958  0.32416398
[2,]  0.06150624 -0.35741054  0.25522669
[3,]  0.25225861 -0.40094541  0.09290729
[4,]  0.13766665 -0.48661991  0.19684403
[5,] -0.22018456  0.02132711  0.33973578
[6,]  0.18217417 -0.36140896 -0.04873320

Parallel Computing

The runParallel function makes it easy to use available multiprocessor cores to speed up parallel computations especially for simulations. By default it runs the number of available cores, less one. runParallel makes the parallel package easier to use and does recombinations over per-core batches. The user writes a function that does the work on one core, and the same function is run on all cores. This function has set arguments and must return a named list. A base random number seed is given, and the seed is set to this plus i for core number i. The total number of repetitions is given, and this most balanced possible number of repetitions is run on each core to sum to the total desired number of iterations. runifChanged is again used, to avoid running the simulations if no inputs have changed.

# Load runParallel function from github
getRs('runParallel.r', put='source')

# Function to do simulations on one core
run1 <- function(reps, showprogress, core) {
  cof <- matrix(NA, nrow=reps, ncol=3,
                dimnames=list(NULL, c('a', 'b1', 'b2')))
  for(i in 1 : reps) {
    y <- sample(0:1, n, replace=TRUE)
    f <- lrm(y ~ X)
    cof[i, ] <- coef(f)
  }
  list(coef=cof)
}
# Debug one core run, with only 3 iterations
n    <- 300
seed <- 3
set.seed(seed)
X    <- cbind(x1=runif(n), x2=runif(n))  # condition on covariates
run1(3)
$coef
              a         b1        b2
[1,] -0.5455330  0.9572896 0.2215974
[2,] -0.2459193  0.3081405 0.1284809
[3,] -0.1391344 -0.2668562 0.1792319
nsim <- 5000
g <- function() runParallel(run1, reps=nsim, seed=seed)
Coefs <- runifChanged(g, X, run1, nsim, seed)
dim(Coefs)
[1] 5000    3
apply(Coefs, 2, mean)
            a            b1            b2 
 0.0020121803 -0.0007277216 -0.0003258610 

Computing Environment

The following output is created by the command markupSpecs$html$session(), where markupSpecs is defined in the Hmisc package.

 R version 4.2.0 (2022-04-22)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Pop!_OS 21.10
 
 Matrix products: default
 BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
 
 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base     
 
 other attached packages:
 [1] rms_6.3-1         SparseM_1.81      data.table_1.13.6 Hmisc_4.7-1      
 [5] ggplot2_3.3.3     Formula_1.2-4     survival_3.2-13   lattice_0.20-45  
 
To cite R in publications use:

R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

To cite the Hmisc package in publications use:

Harrell Jr F (2022). Hmisc: Harrell Miscellaneous. R package version 4.7-1, https://hbiostat.org/R/Hmisc/.

To cite the rms package in publications use:

Harrell Jr FE (2022). rms: Regression Modeling Strategies. https://hbiostat.org/R/rms/, https://github.com/harrelfe/rms.

To cite the data.table package in publications use:

Dowle M, Srinivasan A (2020). data.table: Extension of 'data.frame'. R package version 1.13.6, https://CRAN.R-project.org/package=data.table.

To cite the ggplot2 package in publications use:

Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.

To cite the survival package in publications use:

Therneau T (2021). A Package for Survival Analysis in R. R package version 3.2-13, https://CRAN.R-project.org/package=survival.